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Abstract. Bayesian methods are now one of the main tools for inferring phylogenetic trees 
from aligned sequence data. A recent paper [3] has suggested that Bayesian tree inference 
might be statistically inconsistent, at least for certain combinations of branch lengths (in 
part of the 'Felsenstein Zone'), on the basis of their simulation results. We mathematically 
demonstrate that Bayesian methods for inferring resolved trees are statistically consistent over 
the entire parameter range of branch lengths, under general conditions on priors, at least for 
simple models where the usual 'identifiability' conditions hold. 



1. Introduction 

Bayesian Inference (BI) has become a mainstream approach for inferring phylogenetic tree 
topology from aligned DNA sequence data [I]. The approach has a number of desirable features, 
however some potential problems with the convergence of Bayesian methods (MCMC) have been 
highlighted [5] . The paper [3] investigated the performance of Bayesian inference (BI) for inferring 
phylogenetic tree topology from aligned DNA sequence data under various models. Among the 
interesting findings in their simulation study was that Bayesian methods applied to unresolved 
4-leaf trees (with a zero-length interior edge) with certain combinations of long/short pendant 
branches tended to show increasing bias towards one of three particular resolved trees as the 
sequence length increased, in contrast to maximum likelihood which favored each of the three 
resolutions equally (c.f. [8]). 

The authors suggested the possibility that for data generated by a resolved tree, with a certain 
combination of short and long edges, BI might even be statistically inconsistent (i.e. the tree 
with the highest posterior probability for the data being different to the tree that generated 
the data, with a probability that does not tend to zero as the sequence length grows) even for 
models for which maximum likelihood is known to be statistically consistent pQ. Noting that 
"our finding provide strong, albeit circumstantial evidence that BI is statistically inconsistent," 
the authors of [3J were careful to point out that a formal mathematical analysis of this possibility 
was beyond the scope of their paper. 

In this short paper we provide a self-contained formal proof that BI is statistically consistent for 
inferring the topology of resolved phylogenetic trees over the entire range of branch lengths, and 
under essentially the same identifiability conditions that establish the consistency of maximum 
likelihood estimation of tree topology. We do this by establishing a more general result that 
includes the phylogenetic setting as a particular case. 

1.1. General result. Consider the general problem of identifying a discrete parameter lying in 
an arbitrary finite set A from a sequence of i.i.d. observations that take values in an arbitrary 
finite set U. Suppose further that the probability distribution on U is determined not just by 
the discrete parameter a £ A, but by some additional ('nuisance') parameters. In this paper we 
will assume in that these additional parameters are continuous, and if we denote the parameter 
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space associated with each discrete parameter a £ A by 9(a) that this is an open subset of some 
Euclidean space. 

In the phylogenetic setting, A is the set of fully resolved (binary) phylogenetic tree topologies 
on a given leaf set, U is the set of possible site patterns, and the parameter set 0(a) specifies for 
the tree topology a all its possible branch lengths, each of which lies in the range (0, oo). Thus 
9(a) = (0, oo) 2n ~ 3 where n is the number of leaves of tree a. 

Let P( a ,6) denote the probability distribution on some finite set U determined by the discrete- 
continuous parameter pair (a, 6). Suppose we have a discrete (prior) probability distribution ir 
on A, and, for each a £ A, a continuous (prior) probability distribution on 9(a) with probability 
density function f a {ff). We will suppose that the following conditions hold for all aeA 

(CI) tt(o) > 0; 

(C2) The density f a (&) is a continuous, bounded and nonzero on 9(a); 

(C3) The function 9 i— > P( a ,9) is continuous and nonzero on 9(a); 

(C4) For all 9 £ 9(a), and all b ^ a, we have: infg/ge^) d(p^$),pg, t gr)) > ®> 
where in (C4) and henceforth, d denotes the L\ metric - that is, for any two probability distri- 
butions p, q on U: 

d(p,q) ■■= ^2 \p(u) - q(u)\. 
ueu 

In the phylogenetic setting, if n is any of the usual non-zero priors on binary phylogenetic trees 
(e.g. the uniform (PDA) distribution, or the Yule distribution), then condition (CI) is satisfied. 
If we take the usual exponential prior on branch lengths then condition (C2) is satisfied. For all 
Markov processes on trees condition (C3) holds (the nonzero condition holds since in any tree 
with positive length pendant edges all site patterns have a strictly positive probability). Finally, 
for all models for which identifiability holds (eg. the GTR model or any submodel down to the 
most restrictive the Jukes-Cantor 1969 model) condition (C4) holds (see e.g. i_7[; a specific lower 
bound on d for the 2-state symmetric model is provided via Lemma 7.3 of [6]). 

Now, suppose we are given a sequence u = (u\, . . . , u%) £ U k generated i.i.d. by some unknown 
pair (a, 9) and we wish to identify the discrete parameter (a) from u given prior densities on 
A and the continuous parameters. The Maximum a-Posteriori (MAP) estimator selects the 
element b £ A that maximizes the posterior probability of b given u - that is, it maximizes 
7r(6)E^[P(u|6,0')], where: 

fe 

(1) P(u|&,0) = l[P(b,e>)(ui), 

i=i 

is the probability of generating the sequence of i.i.d. observations (ui, . . . ,Uk) from underlying 
parameters (b,9'), and where Eg/ refers to expectation with respect to the prior probability 
distribution on 9(o). 

Let P(a, 9, k) denote the probability that, for a sequence u%, . . . , Uk generated i.i.d. by (a, 9), 
the MAP estimator correctly selects a. The following theorem establishes a sufficient condition 
for the statistical consistency of the MAP estimator in this context. 

Theorem 1. Provided conditions (C1)-(C4) hold for all a £ A, then limfe-i.oo P(a, 9, k) = 1 for 
all a £ A, and 9 £ 9(a). 
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/(e, 5) := i(e x - <5) 2 - 25 • (| log(e 2 - <5)| + | log(e 3 )|). 



s(u) 



2. Proof of Theorem Q] 

We first present a lemma that is helpful in the proof of Theorem [TJ 

Lemma 2. For any 61,62,63 > there exists 5 > cmd e > so that the following holds: For 
any finite set U , and probability distributions p, q, r,s on U that satisfy the three conditions: 

(i) d(p,q) > ei ; 

(ii) p(u) > 62, q{u) > 63 for all u G U ; 
(hi) d(p, r) < S and d(p, s) < 5; 

we have: 

(2) ErMlog (£W)> e . 

Proof. We will show that any 5 > and e > for which /(e, S) > e > suffices for @, where: 

/M) :=i(6i-<5) 2 -2, 
For each u £ U, let A„ := r{u) — s(u). Then: 

W EK»)iog(^) = E«(»)'» e (^) + E^'»e 

net/ vyv u ueu vyv ;/ u£U 

Now, the first term on the right hand-side of © is simply the Kullback-Leibler separation of 
s and q and, by Pinsker's Inequality [2], this is bounded below by ^d(s,q) 2 . Moreover, by the 
triangle inequality, d(s, q) > (p, q) — d(s,p) > ei — S (by conditions (i) and (hi)), and so the first 
term on the right of © is bounded below by \{e\ — S) 2 . 

Concerning the second term on the right of (0), its absolute value is bounded above by: 

J2 |A„| max I log ( ^ ) | = d(r, s) ■ max | log ( ^ 

Again invoking the triangle inequality, d(r, s) < d(r,p) + d(p, s) < 26 (by condition (hi)); more- 
over, since s{u) > p(u) — 6 > 62 — 5, and q(u) > 63 (both by condition (hi)): 

/ s(u)\ 

max I log — - I < max | log(s(w))| + max | log(q(u))\ < | log(e 2 - S)\ + | log(e 3 )|. 
ueu \<l{ u ) ) u ^ u «e(/ 

Applying this (and the earlier) bound to ([3]) gives: J2 u eu r ( u )^°& (g(«y) — /( e '^)' as required. 

□ 

Turning to the proof of Theorem [TJ we suppose throughout that the sequence u = Ux, ■ ■ ■ , Ufe 
is generated i.i.d. by (a, 9$) where 9q is any particular element of 0(a). Then the MAP estimator 
will correctly select a from u if and only if the Bayes Factor defined by: 

7r(g)E 9 [P(u|M)] 

a/b 7r(&)E e ,[P(u|M')] 

is strictly greater than I for all b ^ a. By the Bonferroni inequality, it suffices to show that for 
each b ^ a the probability that u is such that BF a / b > 1 tends to 1 as k grows. To achieve this 
we first observe that BF a / b = ^fef • R a /b where: 

(A , p E fl [P(u|a,0)] 

(4) H a / b ■ = 



E e ,[P(u|&,0')]' 

and where -^jfy, is a strictly positive by (CI). Thus it suffices to show that, for each b ^ a and 
for any finite constant M, the inequality R a /b > M holds with a probability that tends to 1 as 
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k tends to infinity. We will establish this inequality by providing an explicit lower bound to the 
numerator of R a /b and an explicit upper bound to the denominator of R a /b, and showing that, 
with probability tending to 1 as k grows, their ratio exceeds M. 

Before describing the lower bound, observe that we can re-write Eqn. {1} as follows: 



(5) P(u|M) = II P(M)(«) n " = II *CM)(«) 



r(u)k 

ueu ueu 



where, for each u S U, n u := \{j : Uj = u}\, and r{u) := \n u . Note that r = (r(u) : u G U) 
is a (empirical) probability distribution on U (i.e. its entries are nonzero and sum to 1). In the 
phylogenetic setting r describes the frequency of site patterns in the data. 

For the lower bound on the numerator of R a /b, consider the subset N T of 0(a) consisting of 
a closed ball centered on 9q and of radius r > 0. Note that we can always select a sufficiently 
small value of r > for which N T C 0(a) by the assumption that 9(a) is an open subset of 
some Euclidean space. Letting /j,(N t ) — J N f a (9)d9 > we have: 

(6) E e [P(u|M)] = / F(xi\a,0)f a (6)d6> [ P(u|o, 9)f a (9)d9 > (i(N T ) ■ inf {P(u|o, 9)}. 

Now, let s be the probability distribution of the form P( a ,e) that minimizes P(u|a, 9) when 9 
is restricted to N T ; such a distribution s exists from the compactness of N T and the continuity 
condition of (C3). Then, from © we have: inf eeJ v T {P(u|a, &)} = Y[ u£U s{u) r ^ k . Applying this 
to © gives: 



(7) EflpP(u|M)] >/x(iV T ). Y[s(u) r ^ 



ueu 

Regarding the upper bound on the denominator of R a /b, we have: 



(8) E 6 >[¥(u\b,9')}< sup {¥(u\b,9% 

6'ee(b) 

Given u, let 9i be a sequence of elements of &(b) for which lim^oo P(u|6, 9i) — sup 6/ / e0( - b - ) {P(u|6, 9')}. 
Then p^fiAi * > 1, is a sequence in a bounded subset of Euclidean space (the probability simplex) 
and so, by the Bolzano-Weierstrass theorem, it has a convergent subsequence, with limit q (a 
probability distribution on U). 

From Eqns. Q, Q and © we have: 

a/b ~ Ylueu^Y {u)k 

and so: 



(9) hg(R a/b ) > log(/i(iV T )) + k r(u) log 

ueu 

In particular, for any positive constant M > 1, we have R a /b > M as k — > oo provided that, for 
some e > that does not depend on k or t: 

(10) EK«)log(^)>e>0. 

Definitions: For p > let E p be the event that u is such that d(p, r) < p, where r is the 
empirical distribution determined by u and defined after Eqn. (|S|). Let p := P( a ,e )j and let 
e 2 := mm{p(u) : u e U} > (by (C3)). 

Claim: For p = ^62, if u satisfies E p then q(u) > £3 for some 63 > determined by £2- 
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Proof of Claim. By condition (C3) (applied to b) if we select any B\ G Q(b) then for some v > 
we have P(b,e 1 )( u ) > v f° r a ll u G U, and so: 

(11) P(u|Mi) > zA 

Now, if g(u) < £3 for at least one u £U, then sup / g0 ( f) ){P(u|&, 0')} < q(u) n ™ < eg". Moreover, 
if u satisfies £? p , then n„ > k(p(u) — p) and, since p(u) > e and p — e/2 we have: 

(12) sup {P(u|M')} < 4' (p(tl) " p) < £g e2/2 . 

9'ee(6) 

However if £3 is small enough that: £ 3 2 ^ 2 < v then (jlll) and (fT2j) imply P(u|6, 0i) is strictly 
greater than sup e , ee / b \{P(u|6, 6*')}, a contradiction. Thus for such a small value of £3 we have 
q(u) > £3 for all u € U, which establishes the Claim. 

Returning to the proof of Theorem [TJ we are now in a position to apply Lemma [5J For a fixed 
b 7^ a, let £1 = iT^e' £&(b){d(p, P(b.e'))} ■ By (C4), £1 > 0, and since q is either of the form P(b,6'), 
or is the limit of such distributions as 9' S 8(6) varies, we have: 

d(p,q) > £1 > 0. 

Recall that £2 = min-fjj(u) : it e U} > by (C2). Set p — so that if u satisfies E p then 
(by the Claim), q(u) > £3 > 0. Applying Lemma [5] there exists e > and 6 > (dependent just 
on £1, £2, £3) so that, provided u satisfies the event E e2 / 2 (in order for condition (hi) to hold for 
q), and provided d(p,r),d(p, s) < 5, then Inequality (fTUl) holds. Now, the event that d(p,r) < S 
is precisely the requirement u satisfies event Eg, and for any S > this event has probability 
converging to 1 as k grows, by the law of large numbers (the same comment applies for the 
requirement that u satisfies E^m). Moreover, by the continuity condition in (C3), we can select 
r > sufficiently small so that d(p, s) < S. Although a small r inflates the magnitude of the 
negative first term (log(A r )) on the right of ©, the second term is positive (by (ITU)) ) and grows 
linearly with k. Thus, for any finite value M, R a /b > M with probability converging to 1 as 
k — > 00. By the comments following this completes the proof. □. 

Remark In the proof, notice that only the probability distribution p = P( a ,e ) is fixed, the 
other three distributions r, s, q depend on the data u that is generated by p. However, Lemma[2] 
quantifies over all choices of r, s, q once the e; values have been specified, and these, in turn, 
depend ultimately just on p. 
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